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What is it? 


Automatic Speech Recognition (ASR) is a digital communication method that 
transforms spoken discourse into written text. This rapidly evolving technology 
is used in email, text messaging, or live video captioning, Current ASR systems 
operate in conjunction with Natural Language Processing (NLP) technology to 
transform speech into text that people ~ and machines ~ can read. NLP refers 
to the methodologies and computational tools that analyze data produced in a 
natural language, such as English. 


When users talk into an ASR-enabled application, the speech signal turns into 
an audio file that is first filtered for background noise and then parsed into 
phonemes, which are the smallest sound units in a language: the word ‘push’, for 
example, has three phonemes (‘p’, ‘u’, and ‘sh’), Through statistical probability, 
the ASR system analyzes the phoneme sequences it ‘recognizes’ and deduces the 
words that best match those sound strings. The auto-generated text can then be 
“tead’ by a machine to perform some other tasks. 
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Chapter 19. Automatic speech recognition 


Self-study is the most frequent pedagogical approach taken when integrating 
ASR into language education, as it usually mediates learner-device interactions 
instead of leamner-leaner exchanges. 


ASR is effectively used for pronunciation training (Pennington & Rogerson- 
Revell, 2019), but more recent uses (Istrate, 2019; Liakin, Cardoso, & Liakina, 
2015; Nickolai, 2015) show that ASR can also promote oral skills beyond 
pronunciation, 


Examples 


iSpraak.com (Nickolai, 2015), a cloud-based ASR tool, ‘listens’ to how a student 
pronounces a text provided by the teacher and returns a similarity score based on 
native speech patterns, The auto-scoring feature encourages independent study: 
learners keep practicing until they reach a certain score, but the teacher does not 
need to listen to every file produced. 


Auto-generated transcripts from speech-to-text engines such as Microsoft 
Stream can also support independent language development (Liakin et al., 
2015), As learners compare what the tool ‘understood’ to what they were 
trying to say, they improve their performance. Some of these tools pair ASR 
with automated translation, which can further help learners self-assess their 
accuracy. 


An emerging ASR application is the use of Virtual Assistants (VA) such as Alexa 
or Siri (Istrate, 2019; see also Underwood, this volume). The communicative 
functions that VAs motivate include uttering commands (“Alexa, play some 
music!”) or asking factual questions (“Siri, what is the weather like in Tokyo 
today?"), Successfully getting a VA to perform the desired action or to provide 
the needed information requires not only pronunciation accuracy, but also 
some knowledge of L2 vocabulary and sentence structure: the learners are not 
reading or repeating model sentences. If the task involves asking questions 
and using the information obtained, listening comprehension is an additional 
skill practiced. 
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Benefits 


Using ASR for pronunciation training may encourage learner autonomy: the 
immediate feedback provided by the software, in the form of a transcript or an 
accuracy score, makes learners more aware of their progress, and the ability to 
carry out the exercises without the teacher gives them more control over their 
practice. 


Speaking tasks with VAs also increase speaking opportunities beyond the 
classroom. VAs are not suitable for conversational practice, yet, but producing 
the short action-oriented or information-seeking utterances typical in these 
tasks is still a good proficiency-building exercise that can prepare learners for 
more involved oral discourses. In fact, frequent use of VAs for independent 
practice has been linked to significant improvements in L2 speaking proficiency 
(Dizon, 2020). 


Potential issues 


An important issue in ASR’s pedagogical application is data privacy. As with 
other web-based interactions, exchanges with VAs produce personal data that 
could be commercially exploited. Thus, it is important for educators to be 
mindful of the data privacy policies for the technologies they use. 


A second concern is robustness. ASR accuracy depends much on the acoustic 
conditions (performance suffers in noisy environments) and, most importantly 
for language educators, the speaker’s experience with the language. Users often 
complain that the ASR tool ‘detected the wrong thing’, even though they know 
they were saying it right. 


Although ‘comprehension’ of accented speech keeps improving, ASR 
performance is still not ideal when transcribing speech produced by low- 
proficiency learners. This issue may be resolved as more data from this type of 
leamer becomes available. ASR accuracy with non-native speech has improved 
due to increased computing power and data availability from commercial sources 
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(telephone-based transactions, for example). These sources of data, however, do 
not include low-proficiency speakers: who dares to complete a phone transaction 
in a language they are not fluent in? 


EdTech companies offering data-based learning solutions hold the key to 
improve ASR’s robustness: tools such as Extempore are using a wide range 
of non-native deidentified speech data in their servers for research and 
development (Figure 1). 


Auto-generated transcripts that are still highly accurate with novice learners 
will be a welcome grading aid for teachers. Reading is faster than listening, 
particularly if the audio file is plagued with the long pauses typical in low- 
proficiency speech. While auto-generated fluency scores can indicate progress 
on the temporal aspects of speech (frequency and mean duration of pauses, 
percentage of speaking time), transcripts can help teachers provide feedback 
on lexical and syntactic accuracy faster. 
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Figure 1. Prototype for Extempore's ASR-enhanced features. Metadata provided 
by ASR can assist language instructors when grading oral tasks 
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Looking to the future 


The pedagogical examples described above show that ASR 
technology can have an important impact in language teaching 
and learning: automated comparison with native speech pattems 


encourages pronunciation accuracy, self-access speaking tasks 


promote leamer autonomy, and independent oral practice with VAs 


builds proficiency 


There is a need for increased speaking practice outside the classroom 


targeting skills beyond pronunciation. 


Through robust ASR-enabled applications, this supplemental oral 
practice can be completed without necessarily turning into additional 
grading for the teacher. Thus, as ASR with low-proficiency speakers 
becomes more reliable, this technology will be more widely adopted 
for independent and classroom-based language learning. 
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Resource 


For some advice on which ASR apps to try out, see: htips:/ww.techradar.com/news'best- 
speech-to-text-app 
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